The bioinformatics analysis pipeline nfcore/ampliseq is used for amplicon sequencing, supporting denoising of any amplicon and supports a variety of taxonomic databases for taxonomic assignment including 16S, ITS, CO1 and 18S.
Pipeline input was saved in folder input.
Sequencing data was provided in the samplesheet file
Samplesheet.tsv that is displayed below:
Metadata associated with the sequencing data was provided in
Metadata.tsv and is displayed below:
Cutadapt is trimming primer sequences from sequencing reads. Primer sequences are non-biological sequences that often introduce point mutations that do not reflect sample sequences. This is especially true for degenerated PCR primer. If primer trimming were to be omitted, artifactual amplicon sequence variants might be computed by the denoising tool or sequences might be lost due to being labelled as PCR chimera. Primers were trimmed using cutadapt and all untrimmed sequences were discarded. Sequences that did not contain primer sequences were considered artifacts. Less than 27.5% of the sequences were discarded per sample and a mean of 86.3% of the sequences per sample passed the filtering. Cutadapt results can be found in folder cutadapt.
Additional quality filtering can improve sequence recovery. Often it is advised trimming the last few nucleotides to avoid less well-controlled errors that can arise there. Reads were trimmed to a specific length and the length cutoff was automatically determined by the median quality of all input reads. Reads were trimmed before median quality drops below 25 and at least 75% of reads are retained, resulting in a trim of forward reads at 230 bp and reverse reads at 229 bp, reads shorter than this were discarded. Reads with more than 2 expected errors were discarded. Read counts passing the filter are shown in section ‘Read counts per sample’ column ‘filtered’.
Quality profiles:
Forward (left) and reverse (right) read quality stats for incoming data:
Forward (left) and reverse (right) read quality stats for preprocessed data:
Overall read quality profiles are displayed as heat map of the
frequency of each quality score at each base position. The mean quality
score at each position is shown by the green line, and the quartiles of
the quality score distribution by the orange lines. The red line shows
the scaled proportion of reads that extend to at least that position.
Original plots can be found in folder dada2/QC/ with names that end in
_qual_stats.pdf.
DADA2 performs fast and accurate sample inference from amplicon data with single-nucleotide resolution. It infers exact amplicon sequence variants (ASVs) from amplicon data with fewer false positives than many other methods while maintaining high sensitivity.
DADA2 reduces sequence errors and dereplicates sequences by quality filtering, denoising, read pair merging (for paired end Illumina reads only) and PCR chimera removal.
Read error correction was performed using estimated error rates, visualized below. Error rates for forward reads are at the left side and reverse reads are at the right side.
Estimated error rates are displayed for each possible transition. The
black line shows the estimated error rates after convergence of the
machine-learning algorithm. The red line shows the error rates expected
under the nominal definition of the Q-score. The estimated error rates
(black line) should be a good fit to the observed rates (points), and
the error rates should drop with increased quality. Original plots can
be found in folder dada2/QC/ with names that
end in .err.pdf.
Tracking read numbers through DADA2 processing steps for each sample. The following table shows the read numbers after each processing stage. Processing stages are: input - read pairs into DADA2, filtered - read pairs passed quality filtering, denoisedF - forward reads after denoising, denoisedR - reverse reads after denoising, merged - successfully merged read pairs, nonchim - read pairs in non-chimeric sequences (final ASVs).
Samples with unusual low reads numbers relative to the number of expected ASVs should be treated cautiously, because the abundance estimate will be very granular and might vary strongly between (theoretical) replicates due to high impact of stochasticity.
Following, the numbers of the table above are shown in stacked barcharts as percentage of DADA2 input reads. Stacked barchart of read pair numbers (denoisedF & denoisedR halfed, because each pair is split) per sample and processing stage:
Between 55.34% and 59.02% reads per sample (average 57%) were retained for analysis within DADA2 steps.
The proportion of lost reads per processing stage and sample should not be too high, totalling typically <50%. Samples that are very different in lost reads (per stage) to the majority of samples must be compared with caution, because an unusual problem (e.g. during nucleotide extraction, library preparation, or sequencing) could have occurred that might add bias to the analysis.
Finally, 347 amplicon sequence variants (ASVs) were obtained across
all samples. The ASVs can be found in dada2/ASV_seqs.fasta. And the
corresponding quantification of the ASVs across samples is in dada2/ASV_table.tsv. An extensive
table containing both was saved as dada2/DADA2_table.tsv. ASVs were
inferred for each sample independently.
Barrnap classifies the ASVs into the origin domain (including mitochondrial origin).
Barrnap classified 275 ( 79.25 %) ASVs as most similar to Bacteria,
72 ( 20.75 %) ASVs to Archea, 0 ( 0 %) ASVs to Mitochondria, 0 ( 0 %)
ASVs to Eukaryotes, and 0 ( 0 %) were below similarity threshold to any
kingdom.
rRNA classification results can be found in folder barrnap.
ASVs were filtered for bac (bac: Bacteria,
arc: Archaea, mito: Mitochondria,
euk: Eukaryotes) using the above classification. The number
of ASVs was reduced by 72 (20.75%), from 347 to 275 ASVs.
The following table shows read counts for each sample before and after filtering:
In average 21.55 % reads were removed, but at most 35.23 % reads per sample.
Data was processed using nf-core/ampliseq version 2.13.0dev, revision bf04a499cc (doi: 10.5281/zenodo.1493841) (Straub et al., 2020) of the nf-core collection of workflows (Ewels et al., 2020), utilising reproducible software environments from the Bioconda (Grüning et al., 2018) and Biocontainers (da Veiga Leprevost et al., 2017) projects.
Cutadapt (Marcel et al., 2011) trimmed primers and all untrimmed sequences were discarded. Sequences that did not contain primer sequences were considered artifacts. Less than 27.5% of the sequences were discarded per sample and a mean of 86.3% of the sequences per sample passed the filtering. Adapter and primer-free sequences were processed sample-wise (independent) with DADA2 (Callahan et al., 2016) to eliminate PhiX contamination, trim reads (before median quality drops below 25 and at least 75% of reads are retained; forward reads at 230 bp and reverse reads at 229 bp, reads shorter than this were discarded), discard reads with > 2 expected errors, correct errors, merge read pairs, and remove polymerase chain reaction (PCR) chimeras; ultimately, 347 amplicon sequencing variants (ASVs) were obtained across all samples. Between 55.34% and 59.02% reads per sample (average 57%) were retained. The ASV count table contained in total 4921 counts, at least 1030 and at most 1387 per sample (average 1230).
Barrnap (Seemann,
2013) filtered ASVs for bac (bac:
Bacteria, arc: Archaea, mito: Mitochondria,
euk: Eukaryotes), 72 ASVs were removed with less than
35.23% counts per sample (275 ASVs passed).
WARNING This methods section is lacking software versions, these can be found in folder pipeline_info file
software_versions.yml.
Technical information to the pipeline run are collected in folder pipeline_info, including software versions
collected at runtime in file software_versions.yml (can be
viewed with a text editor), all parameter settings in file
params_{date}_{time}.json (can be viewed with a text
editor), execution report in file
execution_report_{date}_{time}.html, execution trace in
file execution_trace_{date}_{time}.txt, execution timeline
in file execution_timelime_{date}_{time}.html, and pipeline
direct acyclic graph (DAG) in file
pipeline_dag_{date}_{time}.html.
This report (file summary_report.html) is located in
folder summary_report of the original pipeline results
folder. In this file, all links to files and folders are relative,
therefore hyperlinks will only work when the report is at its original
place in the pipeline results folder. Plots specifically produced for
this report (if any) can be also found in folder summary_report.
A comprehensive read count report throughout the pipeline can be
found in the base results folder in file
overall_summary.tsv.
Please cite the pipeline publication and any software tools used by the pipeline (see citations) when you use any of the pipeline results in your study.